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ABSTRACT 

Emerging data analysis involves the ingestion and explo- 
ration of new data sets, application of complex functions, 
and frequent query revisions based on observing prior query 
answers. We call this new type of analysis evolutionary an- 
alytics and identify its properties. This type of analysis is 
not well represented by current benchmark workloads. In 
this paper, we present a workload and identify several met- 
rics to test system support for evolutionary analytics. Along 
with our metrics, we present methodologies for running the 
workload that capture this analytical scenario. 



1. INTRODUCTION 

A new analytical landscape has emerged, exemplified by 
the popularity of "big data" systems such as Hadoop as well as 
the recently added support for big data processing by all ma- 
jor data warehouse vendors [8|10| . Data volumes are growing 
rapidly and log files are an important data source, e.g., so- 
cial media or sensor data. Queries are often exploratory in 
nature, and system-facilitated data exploration has been pro- 
posed in 4,11,21,24. Given this scenario, new requirements 
for analysis have been noted in [s] , including the need to ac- 
cess "disparate, decentralized data" [6|. Analysis frequently 
includes complex processing methods such as user defined 
functions (UDFs), e.g., [2l[9| |15| , created by expert users for 
domain-specific processing needs. 

In this new analytical setting, data analysts and data sci- 
entists are becoming increasingly important to businesses [sl 
|19| . Because analysts are tasked with finding value within 
their growing data sources, the speed at which an analyst 
can iterate through successive investigations to gain insight 
is crucial [T^. To measure system performance, there is a 
need for a workload and metrics to capture this emerging 
type of analytics. It is important to understand the features 
of this new type of workload and effective ways to evaluate 
system performance in this space. We term this scenario 
evolutionary analytics and identify the following three im- 
portant characteristics of evolutionary analytics that are not 
captured by existing benchmarks. 

1. Query Evolution. Queries are exploratory and evolve over 
time. A query may go through multiple evolutions (ver- 
sions) whereby an analyst iteratively formulates, tests, 
and refines hypotheses during investigation. Query re- 
visions appear as a sequence of mutations to the orig- 
inal query, and this temporal nature is a key feature. 
While traditional interactive OLAP may perform oper- 
ations such as roll-up or drill-down to slightly modify the 
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Figure 1: System metrics for evolutionary analytics 

query, revisions in exploratory analysis can include more 
types of changes to the query and a longer sequence of 
changes, as we define in Section |3.1[ Typical revisions 
are minor refinements as well as more significant changes 
such as augmented functions, addition (or removal) of 
sub-queries, or incorporating new data sources to obtain 
richer answers. 

2. Data Evolution. Queries may incorporate new data from 
external sources such as raw logs or local data files. New 
data sources should be easily ingested or accessible for 
use during query processing, and these data sources may 
have evolving schemas. A formal ETL (extract, trans- 
form, load) project for a data warehouse can have a very 
high cost in dollars and design time. Enabling access to 
diverse data sources via ETL on-the-fiy is a key feature 
of this new analytical environment. 

3. User Evolution. A fiexible and accessible system should 
enable new users to get started posing queries testing dif- 
ferent hypotheses, potentially over old or new data sets. 
New users in the system arrive less frequently than query 
revisions, and their queries do not closely resemble an- 
other user's queries. 

In this paper, we propose a workload with these features 
and metrics to test how well a system supports them. Query 
response time is a primary metric but it is useful to under- 
stand system performance for other metrics as well. For ex- 
ample, response time may hide several other system over- 
heads, such as the overhead to tune the physical design (if 
this happens online as queries get executed) or the cost to 
load data (if it has to be ETL'ed on-the-fly). By separating 
out these overheads in different metrics, we can see where 
each system excels and also understand how to develop cross- 
bred systems that combine their best features. 

Figure [l] shows our proposed metrics as dimensions (al- 
though not completely orthogonal). Query response time 



indicates how quickly analysts can arrive at answers when 
testing hypotheses. Tuning overhead represents the time ex- 
pended for physical design tuning, i.e., creating indexes and 
materialized views, to improve query processing speed. Data 
arrival to query time indicates the time until newly arrived 
data is query-able. Storage in terabytes indicates the over- 
head for all data and auxiliary data structures (indexes and 
views). Cost in dollars represents the system cost to process 
the workload. Along each dimension, we indicate the rela- 
tive performance of a traditional data warehouse (dw) and 
a Hadoop system. This illustration shows how to compare 
and contrast different systems using our proposed metrics. 
We present experimental results for four data systems using 
these metrics in Section [4] 

In this paper we make the following contributions. 

• We define evolutionary analytics along with our notions 
of query evolution, data evolution, and user evolution, 
and introduce a workload with these properties. 

• We propose relevant metrics for evolutionary analytics 
and describe their tradeoffs. 

• We show how metrics can be used to guide the design of 
hybrid systems that target the best features of specialized 
processing engines. 

2. WORKLOAD CHARACTERISTICS AND 
SYSTEM METRICS 

In this section we first describe our workload properties 
and contrast with other benchmarks, then we describe our 
metrics to test system support for evolutionary analytics and 
highlight the various tradeoffs for each metric. 

2.1 Workload Characteristics 

Data analysis queries dealing with low-structured or log 
data must often perform data extraction tasks as well as an- 
alytical tasks, including the application of machine learning 
algorithms. Some recent examples of such queries are given 
in [18 22 . These queries reference Twitter data and static 
data such as IMDB or historical business sales data. They 
use several UDFs which perform sentiment analysis and clas- 
sification tasks. One query infers movie rating trends for two 
consecutive months, and the other computes the impact of a 
marketing campaign in different sales regions. These queries 
are representative of common tasks in current data process- 
ing. 

Given our previously described workload needs and exam- 
ples from recent data analysis tasks, we determine the fol- 
lowing desirable characteristics for the query workload of the 
benchmark. 

• Significant query complexity, including common UDFs, 
performing non-trivial analysis tasks to gain insights and 
find value within unproven data sources. 

• Several successive versions for each query, representing 
data exploration during hypothesis testing and query re- 
finement. 

• Access realistic data sets during query processing. These 
should include raw logs and static/historical data sets. 

Current analytical benchmarks such as TPC-H and TPC- 
DS [25] do not adequately capture this type of analytical 
workload. For instance, TPC-DS queries focus on known 



data and known reporting tasks, with a carefully designed, 
fixed schema for a data warehouse. This is not always pos- 
sible in the current analytical scenario, as the use-case may 
not afford the up- front, top-down design of a traditional data 
warehouse. In contrast to query evolution where a query goes 
through an ordered sequence of mutations, the set of report- 
ing queries in TPC-DS represent independent tasks where 
the ordering of one query is not dependent on the previously 
executed query. In contrast to data evolution, TPC-DS main- 
tenance workloads reflect table inserts from its counterpart 
OLTP database, but they do not reflect a growing log or 
arrival of a new data source. 

2.2 System Metrics 

We propose the following metrics to evaluate a system for 
evolutionary analytics. 

Query response time. Query performance is a key metric 
of the benchmark, and measures total workload execution 
time. This metric serves as the primary indicator of how 
well a system is able to support the workload features and 
process the workload efficiently. 

Tuning overhead. Physical design tuning can greatly im- 
prove query performance. Tuning might be considered offline 
during a system maintenance window or online during work- 
load processing. This metric reports the cumulative time 
spent on tuning, which is the time spent to run a tuning tool 
and the time to materialize all indexes and views. 

Data arrival to query time. This metric reports the time 
until newly arrived data is available to query. Data prepa- 
ration is an atomic operation that enables the data to be 
accessed by a query. This may include the schema definition 
such as a CREATE table statement and a LOAD operation. 

Storage size. This metric indicates the total storage re- 
quired in terabytes. Total storage includes that required for 
all base data, and all indexes and materialized views. Stor- 
age size can be asymmetrical even for base data, considering 
some systems replicate data by design (e.g., Hadoop) whereas 
other systems might not. 

Monetary cost. This metric indicates the total system cost 
for query processing and data storage. For simplicity, in this 
work we use dollar cost to include only machine time and 
storage cost. A better cost metric could be total cost of own- 
ership (TCO), as that cost includes system administration 
cost as well as hardware costs. The cost value is useful when 
considering tradeoffs that are tolerable for a given environ- 
ment. This may be particularly important for exploratory 
data analysis since the return on investment may not be ini- 
tially known. 

2.2.1 Metric tradeoffs 

Our metrics can be used to understand the various trade- 
offs to consider for system design. Previous studies [2^ have 
considered load times and query response time. Here we in- 
troduce additional metrics and show how they interact with 
each other. Clearly response time interacts with all of the 
other metrics of data loading, physical design tuning, stor- 
age space, and cost. Reducing workload execution time can 



be achieved through a combination of tradeoffs among the 
other metrics. 

For example, tuning overhead impacts both query response 
time and storage size. A good physical design can consume 
multiple times the size of the base data, but may reduce 
workload cost dramatically. Due to their size, the choice of 
indexes and views will also appear as a tradeoff along the 
storage metric. Loading may require data cleaning, trans- 
formation, and copying/storing the data, which is a typical 



evolutionary analytics typically go through several revisions. 

Specifically, we commonly observed the following 4 types 

of changes during query revisions from a sampling of |23l|25l 



ETL task in a data warehouse. In contrast, using Hive 24 
requires only the schema definition to be provided before a 
query can access the data. This presents a tradeoff between 
query response time and data load time. 

The cost metric leads to interesting tradeoffs for sys- 
tem design. In particular, the advent of the cloud en- 
ables pay-as-you-go performance, allowing for a rich set of 
choices for query processing. For example, Hadoop [TIIt], 
databases 116, and recently even petabyte-scale data ware- 
houses (e.g.,Redshift fij) are all available on-demand. More- 
over, a mixture of systems may be used for query processing 
as we show later. 

The importance of each metric may be weighed differently 
for a particular environment. The purpose of including all 
five of them is to help understand the impact of various trade- 
offs in order to guide system design. Next we describe the 
specifics of our workload and how it can be evaluated us- 
ing these metrics to highlight the relevant aspects of system 
performance. 

3. THE WORKLOAD 

Our workload considers 8 hypothetical analysts who write 
queries for marketing scenarios involving restaurants using 
social media data and static data. For social media data we 
use a sample of the Twitter data stream and user check-in 
data from Foursquare. For static data we include a Land- 
marks data set (landmark locations). Each analyst poses 
one query which is then revised multiple times. There are 
4 versions of each query, representing the original query and 
3 subsequent revisions. Next we define the types of changes 
allowed for each revision, and then provide a workload that 
uses these changes. 

3.1 Query building blocks 

Queries that evolve during exploratory data analysis may 
follow certain patterns of common changes. As an analyst 
revises a query, she may tweak the selectivity to produce 
greater or fewer answers, include additional data sources for 
stronger evidence of hypothesis, add a UDF to perform a spe- 
cialized processing function, or refine the results by including 
or removing a query sub-goal as more is learned about the 
data after each query revision. 

To make these changes concrete, we evaluated complex an- 
alytical queries from several sources to find evidence of the 
manner in which queries evolve. The TPC-DS [25] workload 
includes 4 interactive OLAP queries that go through 2 revi- 
sions each. Taverna [23] queries on MyExperiment [Tt] are 
scientific queries that retain all of their revisions. Each of 
the top 10 most-downloaded Taverna queries had 2-11 revi- 
sions. Yahoo! Pipes [26] has many versions of user queries 
over open-access web data, with more than 99 data sources. 
Queries in Pipes are easily clone-able and modified by any 
user, and in one instance we observed a query with more 
than 49 revisions. These observations suggest that queries in 



We note these changes are not mutually exclusive nor 
laustive but representative. 
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Figure 2: Dimensions of change between query ver- 
sions 

(P) Parameters: The query parameters are modified to 
obtain slightly different results (Figure[2l-'). For example, the 
analyst may alter a selection predicate or a top-fc value to 
allow more or less data in the output. 

(L) Logs: An analyst may make use of an additional data 
source in the query to obtain richer results (Figure [2]L) . 

(U) UDFs: An analyst may add or replace a set of opera- 
tions in the query with a specialized UDF. 

(G) Sub-Goals: Typically, an analyst writes several sub- 
queries that each achieves a single goal and then joins these 
to obtain the final output (Figure [2]G1). A revision may add 
or remove a sub-goal. 

These four dimensions {P,L,U,G} serve as our evolving 
query building blocks. For each query revision, the changes 
from one version to the next are expressed by one or more of 
these dimensions. 

3.2 Queries 

We give a high-level description of each query scenario in 
Table[l]left column, and the right column specifies the change 
from one version to the next in terms of our query building 
blocks. For instance, let Q represent the analyst's first ver- 
sion of the query. Then each subsequent version (i.e., 2,3,4) is 
represented by indicating the dimensions that were changed 
during each revision in the following way: 

Q^{P,L,G}^{P}^{P,G} (1) 

To illustrate this process, we start with Example [l] as the 
first version of Analyst I's query in Table [l] This version is 
indicated by Q above. Analyst 1 first desires to find users 
who like wine, are affluent, and have many good friends. 

Example 1. (a): EXTRACT user from Twitter log. 
Apply UDF-CLASSIFY-WINE-SCORE on each user tweet 
to obtain a wine-score. Groupby user, compute a wine- 
sentiment- score for each user. 

(h): From Twitter log, apply UDAF-CLASSIFY- 
AFFLUENT on tweets to classify a user as affluent or not. 

(c): From Twitter log, create social network between every 
user pair using tweet source and dest. GROUPBY user pair 
in social network, count tweets. Assign friendship- strength- 
score to each user pair. 

JOIN (a),(b), and (c). Threshold based on wine- sentiment- 
score, friendship-strength-score. 

Next the analyst wants to find more evidence that the 
user likes wine. She revises the query by changing {P,L,G}, 
adding two new data sources {L} (Foursquare and Land- 
marks), a new sub-goal {G} that computes a checkin-count 
for users who go to wine places, and decreases the threshold 



Table 1: Eight analyst marketing scenarios, along with the dimensions modified during each of the 4 evolutions 



Analystl wants to identify a number of "wine lovers" to send them a coupon for a now wine 
being introduced in a local region. This evolution investigates ways of finding suitable users to 
whom sending a coupon would have the most impact. 


Q -^ {P, L, G} -^ {P} -^ {P, G} 


Analyst 2 wants to find infiuontial users who visit a lot of restaurants for inclusion in an adver- 
tisement campaign. The evolution of this scenario will focus on increasingly sophisticated ways 
of identifying users who are "foodies". 


Q -s> {L, U, G} -s> {P, G} -^ {P, G} 


Analysts wants to start a gift recommendation service where friends can send a gift certificate 
to a user ui. We want to generate a few restaurant choices based on mi's preferences and his 
friend's preferences. The evolution in this scenario will investigate how to generate a diverse set 
of recommendations that would cater to u and his close set of friends. 


Q -> {P, G} -^ {P, L, G} -^ {G} 


Analyst4 wants to identify a good area to locate a sports bar. The area must have a lot of 
people who like sports and check-in to bars, but the area does not already have too many sports 
bars in relation to other areas. The evolution focuses on identifying a suitable area where there 
is high interest but a low density of sports bars. 


Q ^ {U, G} -^ {L, U, G} -^ {U, G} 


Analysts wants to give restaurant owners a customer poaching tool. For each restaurant r, we 
identify customers who go to a "similar" restaurant in the area but do not visit r. The owner 
of r may use this to target advertisements. The evolutionary nature focuses on determining 
"similar" restaurants and their users. 


Q -^ {L, G} -^ {L, U} -^ {P, G} 


Analyst6 tries to find out if restaurants are losing loyal customers. He wants to identify those 
customers who used to visit more frequently but are now visiting other restaurants in the area 
so that he can send them a coupon to win them back. The evolutionary nature of this scenario 
will focus on how to identify prior active users. 


Q -^ {L, G\ -^ |P, G\ -^ {P, G\ 


Analyst? wants to identify the direct competition for poorly-performing restaurants. He first 
tries to determine if there is a more successful restaurant of similar type in the same area. The 
evolutionary nature focuses on identifying good and bad restaurants in an area, as well as what 
customers like about the menu, food, service, etc. about the successful restaurants in the area. 


Q ^ {L,G} ^ {G} ^ {U,G} 


Analysts wants to recommend a high-end hotel vacation in an area users will like based on their 
known preferences for restaurants, theaters, and luxury items. The evolutionary nature focuses 
on matching user's preferences with the types of businesses in an area. 


Q -> {L, G} ^ {U, G} ^ {P, L, U, G} 



parameter {P} for wine-sentiment-score since she will have 
evidence a user likes wine from 2 data sources. Example |2] 
below describes version 2 of the query. 

Example 2. (d): EXTRACT from Foursquare log. For 
each checkin, obtain the user and restaurant name. Using 
the Landmarks data, filter by checkin to places of type wme- 
bar. Groupby user, compute checkin- count, 
(e): Decrease wine-sentiment-score threshold 
JOIN (a),(b),(c) and (d). Threshold based on new wine- 
sentiment-score in (e), friendship-strength-score. 

Version 3 and 4 of the query are revised similarly but are 
omitted here due to lack of space. A description of all queries 
is provided in the Appendix of the extended version of this 
paper .14j . 

4. RUNNING THE BENCHMARK 

We now present our benchmark methodology for query 
evolution, user evolution, and data evolution and we show 
an example of benchmark results. We consider the initial 
system state to be idle, with no previously loaded data or 
executed queries. 

4.1 Benchmark methodology 

Query evolution. This test will use all analysts 1-8 and all 
query versions from each analyst. (1) From initial system 
state, execute analyst 1 query versions 1 through 4 in suc- 
cession, returning to initial system state before each version. 
(2) From initial system state, execute analyst 1 query ver- 
sions 1 through 4 in succession, without returning to initial 
system state before each version. Compare metrics from (1) 
and (2), and repeat for each remaining analyst. This com- 
parison highlights a system's ability to process any repeating 
tasks from the same user. 



User evolution. This test will use all analysts 1-8 but only 
version 1 of each analyst's query. First, assume some order 
of analysts 1-8. (1) From initial system state, execute each 
analyst's query in the chosen order, returning to initial sys- 
tem state before each query. (2) From initial system state, 
execute each analyst's query in the chosen order, without re- 
turning to initial system state before each query. Compare 
metrics from (1) and (2). This comparison highlights a sys- 
tem's ability to process similar tasks from different users. 

Data evolution. This test will use a single data source, e.g.. 
Twitter log, and the subset of data requested in the first 
step should be a number of columns equal to half of the total 
number of columns in the log schema. The columns should be 
randomly chosen each time. (1) From an initial system state, 
an analyst requests a subset of data from a new data source. 

(2) An analyst requests one additional attribute from the 
data source in (1), in each successive version of the query. (3) 
A new analyst requests a subset of data previously accessed 
by the analyst in (1). (4) Repeat (1), (2), (3) returning to the 
initial state after each query. Compare metrics from (1), (2), 

(3) with (4). This comparison highlights a system's ability 
to access subsets of data from a new data source on demand. 

4.2 Example benchmark results 

Next, we briefly show a sample reporting on the relative 
performance of four data systems using our workload and 
metrics for a user evolution scenario. The experimental setup 
consists of 9 nodes running a widely used commercial parallel 
data warehouse (DW) and 14 nodes running Hadoop. The 
ratio of Hadoop nodes to DW nodes is 1.5 x. The DW and 
Hadoop clusters are independent, and nodes are connected 
with 1 GbE. Each node has two 2.4 GHz xeon CPUs and 
a local 2 TB disk. In this test, our data includes a 1 TB 
Foursquare log, a 1 TB Twitter log and 12 GB Landmarks 
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Figure 3: Figure shows a sample reporting of 4 data systems on a user evolution scenario 



log. We use Hive [24| to execute our queries on the Hadoop 
system. Since all systems utilize the base data stored in 
Hadoop, we omit this from the storage metric. 

Figurelslreports the results for the user evolution scenario. 
HADOOP corresponds to a Hadoop-only execution of the 
query. DW executes the query on the DW but uses Hadoop 
as an ETL tool to extract the subset of data required by the 
query. MV-HDP corresponds to a system that we developed 
in [iSi that rewrites Hadoop queries based on opportunistic 
materialistic views left behind from prior execution runs. MS 
is an implementation of a multi-store query optimizer (sim- 
ilar to [22]) that uses both Hadoop and a data warehouse 
together to execute each query. 

It can be seen from the figure that reporting on the 5 met- 
rics exposes their tradeoffs, which are not easily captured 
when reporting just on the query execution time. Since 
Hadoop performs ETL on the fly, query performance is quite 
poor compared to the DW. On the other hand, the superior 
performance of the DW is offset by the high cost of loading 
the data into the data warehouse. Both MV-HDP and MS 
show tradeoffs that reduce query response time. HADOOP 
and MV-HDP do not incur any tuning overhead whereas DW 
and MS require a tuning phase to provide good performance. 
We used Amazon EC2 and Redshift [I] pricing to approxi- 
mate the dollar cost of the machines and storage (based on 
machines similar to our clusters). The cost values show that 
HADOOP is far cheaper than DW while MS is cheaper than 
both, and MV-HDP has the lowest cost. Finally, it can be 
seen that MV-HDP incurs a significant storage overhead by 
retaining results as opportunistic views from all the prior ex- 
ecutions runs. The tradeoff with storage size improves query 
response time compared to HADOOP. 



5. DISCUSSION 

In the new analytical space, the key question is how to 
design systems to address emerging needs. The continued 
popularity of Hadoop and data warehouses notwithstanding, 
these are only suitable when the required use-case matches 
either of their starkly different characteristics. One focuses 
on being able to query the data right away, tolerating lesser 
performance. The other focuses on performance at the ex- 
pense of significant delay in being able to query the data. 
These systems represent two ends of a spectrum, and the 
infiux of so many new data processing systems shows that 
these two distinct choices are not meeting all current needs. 

In this paper, our metrics highlight the tradeoffs among 
many design choices and the metrics can be used to guide 
system development. For example, we show 2 systems that 
remedy one dimension by shifting the tradeoff with another 
dimension. With MV-HDP we show that increased storage 
leads to better performance than Hadoop. With MS we show 
that one can remedy the loading time of a data warehouse 
to an extent by sacrificing some of DW performance. An 
interesting further research direction is to leverage the best 



properties of several systems to create hybrid systems. 
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APPENDIX 

A. QUERY DESCRIPTIONS 

In this section we describe each of the queries. These 
queries use three datasets: a Twitter data stream of user 
tweets, a Foursquare (4SQ) data stream of user checkins, and 
a Landmarks log of locations and their types. The identity of 
users is common across the Twitter and 4SQ logs, while iden- 
tity of locations is common across the 4SQ and Landmarks 
logs. All logs are stored as JSON text. 

We present 8 Analyst's queries with 4 versions each. For 
each analyst, we state a high-level goal of what the analyst 
is trying to achieve, as well as 4 query versions toward the 
stated goal. Each query version modifies the previous ver- 
sion. For each query, we describe the task rather than pro- 
vide an implementation in particular language since many 
are possible, e.g., SQL, HiveQL, Pig, Java, etc. 

During a typical exploration, and analyst may spend a 
lot of time identifying the data distribution to get an un- 
derstanding of where the density and sparsity lies. For this 
reason, the queries below have parameters indicated by an 
underline that an analyst would modify in order to obtain a 
representative answer set. Choosing an appropriate value or 
interpretation of the underlined parts of the queries is a nec- 
essary step an analyst performs. This may require a trial and 
error process resulting in additional versions of the queries. 
We leave these values unspecified as they are a function of 
the real-world datasets used. 

Finally, it is important to note that the queries are not rigid 
interpretations of a goal, but rather one approach toward an- 
swering a question. For example, there may be several ways 
to interpret what it means to be "good" friends in a social 
network. Furthermore, even for a particular interpretation, 
there may be several ways to express it as a query. Hence, 
other variations of a query are possible. 

User Defined Functions referenced by queries 

The following queries reference multiple UDFs, for which 
we provide a brief description below. Text classifiers can 
be implemented using a bag-of-words method. For exam- 
ple, classes such as COFFEE-drinker may include the words 
{coffee, espresso, latte, french press) while wine-lover may 
include (cafeernei, vineyard, merlot, chardonnay). By accept- 
ing a bag of words as an argument, these UDFs are easily 
reusable. However, this does not exclude other classification 
methods. UDAFs perform a groupby on a key and apply 
similar classification on the elements of a group. 

1. UDF-CLASSIFY- WINE-SCORE: Input is text and out- 
put is wine-score indicating a strong presence of wine- 
terms. 

2. UDF-CLASSIFY-FOOD-SCORE: Input is text and out- 
put is food-sentiment-score indicating a strong presence 
of food-terms. 

3. UDF-GRID-CELL: Input is lat-lon coordinates and grid 
resolution, and output is a grid-cell number. 

4. UDF-CLASSIFY-BEER-SCORE: Input is text and out- 
put is beer-score indicating a strong presence of beer- 
terms. 

5. UDF-MENU-SIMILARITY: Input is two lists of menu 
items and output is a score indicating the similarity of 
the lists. 



6. UDF-NLP-ENTITITY-SENTIMENT: Input is text and 
output is the entities extracted from the text with a 
sentiment-score for each entity. 

7. UDF-CLASSIFY-LUXURY-SCORE: Input is text and 
output is a binary value indicating if the text concerns 
luxury items. 

8. UDF-SENTIMENT: Input is text and output is a 
sentiment-score expressing positive or negative sentiment 
with the score indicating the strength of the sentiment. 

9. UDAF-CLASSIFY-AFFLUENT: Input is all text from a 
given user and output is a binary value indicating if the 
user is affluent or not. 

10. UDAF-CLASSIFY-SPORTS: Input is aU text from a 
given user and output is a binary value indicating if the 
user is interested in sports or not. 

A.l Analyst 1 

Analyst 1 wants to identify a number of "wine lovers" to 
send them a coupon for a new wine being introduced in a local 
region. This evolution investigates ways of finding suitable 
users to whom sending a coupon would have the most impact. 

A.1.1 Analyst 1 , Version 1 

• Analyst goal: Find users that like wine, have strong 
friendships, and are affluent. 

• Query: From Twitter, apply UDF-CLASSIFY- WINE- 
SCORE on each user's tweets and groupby user to pro- 
duce wine-sentiment-score for each user. Threshold on 
wine-sentiment-score above xi . 

From Twitter, compute all pairs («i,M2) of users that 
communicate with each other, assigning each pair a 
friendship-strength-score based on the number of times 
they communicate. Threshold on friendship-strength- 
score above X2- 

From Twitter, apply UDAF-CLASSIFY-AFFLUENT on 
users and their tweets. 

Join results by user. 
A. 1.2 Analyst 1 , Version 2 

• Analyst goal: Next, consider users to be wine-lovers if 
they checkin to many wine places. 

• Query: From previous version, reduce wine-sentiment- 
score threshold to x'2 since now there will be additional 
evidence a user likes wine. 

From 4SQ, identify places that users checkin. Join with 
places in Landmarks. Select users that checkin to places 
of type wine-bar. For each user, count the number of 
checkins. Threshold on checkin-count above X3. 

Join these with users from previous version. 
A. 1.3 Analyst 1 , Version 3 

• Analyst goal: Now find users that are also in the San 
Francisco area as well as prolific on Twitter. 

• Query: From previous version, select users local to San 
Francisco. Threshold on tweet-count above X4. Adjust 
xi, X2, X3 appropriately to produce "enough" answers. 



A.1.4 Analyst 1 , Version 4 



A. 3.1 Analyst 3, Version I 



• Analyst goal: Finally, require that a user's friends must 
also visit wine-places. 

• Query: For user pairs {ui,U2}, threshold on U2 checkin- 
score above x^. For each user mi, count the number of 
friends with checkin-count above threshold. Retain ui 
if count above a^. Join these with users from previous 
version. 

A.2 Analyst 2 

Analyst 2 wants to find influential users who visit a lot of 
restaurants for inclusion in an advertisement campaign. The 
evolution of this scenario will focus on increasingly sophisti- 
cated ways of identifying users who are "foodies". 

A.2.1 Analyst 2, Version 1 

• Analyst goal: Find users who frequently visit restaurants. 

• Query: From 4SQ, identify places that users checkin. Join 
with places in Landmark log. Select users that checkin to 
places of type restaurant. For each user, count the num- 
ber of times they checkin to a place of type restaurant. 
Compute the normalized-count based on the maximum 
count across all users. Threshold on normalized-count 
above xi. 

A.2. 2 Analyst 2, Version 2 

• Analyst goal: Additionally, user also likes food if they 
talk positively about food. 

• Query: From Twitter, apply UDF-CLASSIFY-FOOD- 
SCORE on each user's tweets and groupby user to pro- 
duce food-sentiment-score for each user. Threshold on 
food-sentiment-score above x^- 

Join these users with users in previous version. 

A.2.3 Analyst 2, Version 3 

• Analyst goal: Further define that a user likes food if they 
dine at many different types of restaurants. 

• Query: Revise previous version by counting the number 
of times a user has visited each distinct type of restaurant. 
Select users who have visited xz distinct restaurant types 
at least xa_ times. 

A.2. 4 Analyst 2, Version 4 

• Analyst goal: Finally, require that these users do not 
frequently visit restaurants with low ratings. 

• Query: From previous version, compute the percentage of 
each user's checkins to restaurants with ratings less than 
Xf,. Threshold on percent below x^. 

A.3 Analyst 3 

Analysts wants to start a gift recommendation service 
where friends can send a gift certificate to a user u. We 
want to generate a few restaurant choices based on u's pref- 
erences and m's friend's preferences. The evolution in this 
scenario will investigate how to generate a diverse set of rec- 
ommendations that would cater to u, and «'s close set of 
friends. 



• Analyst goal: For each user u, identify those restaurants 
that It's good friends frequently visit. 

• Query: From Twitter, compute all pairs {ui,U2) of users 
that communicate with each other, assigning each pair a 
friendship-strength-score based on the number of times 
they communicate. Threshold on friendship-strength- 
score above xj_. 

From 4SQ, identify places that users checkin. Join with 
places in Landmarks. Select users that checkin to places 
of type restaurant. 

For each user ui, find all the restaurants that her friends 
M2 have visited. For each restaurant, count the number 
of checkins. Threshold on count above X2- 

A.3. 2 Analyst 3, Version 2 

• Analyst goal: Next, only consider users that have friends 
in the same area as well as other friends in common. 

• Query: Revise the previous version by redefining what it 
means to be good friends. From Twitter, recompute all 
pairs (mi , M2) of users that live in the same area, and have 
a friendship-strength-score above a^. Additionally, a user 
pair {ui, U2) are said to be good friends if they have more 
than X4_ friends in common. 

A.3. 3 Analyst 3, Version 3 

• Analyst goal: Next, identify only those restaurants that 
are the same type as a user's favorite restaurant. 

• Query: From 4SQ, for each user «i, find favorite restau- 
rant type by counting the number of checkins to each 
restaurant, and select the restaurant with the max num- 
ber of checkins as ui's favorite restaurant r. 

Join with Landmarks to obtain r's type. 

From the previous version, select only those restaurants 

for ui that belong to the same type as mi 's favorite type. 

A.3. 4 Analyst 3, Version 4 

• Analyst goal: Finally, find additional restaurants that are 
similar to those visited by a user's friends. 

• Query: From 4SQ, for all restaurant pairs (ri,r2), count 
the number of users that have visited both restaurants. 
Threshold on count above x^. All remaining pairs (ri, r2) 
are considered to be similar since they have many com- 
mon customers. For each user ui, suggest r2 if r\ is a 
restaurant frequently visited by «i's friends. 

A.4 Analyst 4 

Analyst4 wants to identify a good area to locate a sports 
bar. The area must have a lot of people who like sports and 
check-in to bars, but the area does not already have too many 
sports bars in relation to other areas. The evolution focuses 
on identifying a suitable area where there is high interest but 
a low density of sports bars. 

A.4.1 Analyst 4, Version 1 

• Analyst goal: Find users who like beer and where they 
live. 



• Query: From 4SQ, identify users and their location that 
frequently mention the word "beer" in their text. For 
each user, count the occurrences of the word. Threshold 
on count above a^l_. 

A.4.2 Analyst 4, Version 2 

• Analyst goal: Next, find areas where there are many beer 
lovers. 

• Query: From the previous version, use UDF-GRID-CELL 
to map user locations to a grid cell. Count number of 
users in each grid cell. Threshold on count above X2- 

A.4.3 Analyst 4, Version 3 

• Analyst goal: Next, find areas with many users that like 
beer and sports but do not have many sports bars. 

• Query: From Twitter, apply UDAF-CLASSIFY- 
SPORTS on users and their tweets. Then apply a UDF- 
CLASSIFY-BEER-SCORE to better identify users that 
like beer, and produce a beer-score for each user. Join 
sports and beer users. Threshold on beer-score above 
X3. Next, apply UDF-GRID-CELL to map user locations 
to a grid cell. Count number of users in each grid cell. 
Threshold on count above X£. 

From Landmarks, obtain restaurant name, type and lo- 
cation. Select places that are type equal to sports bar. 
Next, apply UDF-GRID-CELL to map place locations to 
a grid cell. Count number of restaurants in each grid cell. 
Threshold on count below X5_. 

Join grid cells from user locations and sports bar loca- 
tions. 

AAA Analyst 4, Version 4 

• Analyst goal: Finally, find area with high user interest 
but few popular sports bars relative to the number of 
users. 

• Query: From 4SQ, identify places that users checkin. 
Join with places in Landmarks. Select places that are 
type equal to sports bar. For each place, count the num- 
ber of checkins. Threshold on count above xq . 

Next, apply UDF-GRID-CELL to map place locations to 
a grid cell. Count the number of places per grid cell. 
Join this with the grid cells from previous version. 
Threshold on ratio of user to sports bars count above 

Xt_. 

A.5 Analyst 5 

Analysts wants to give restaurant owners a customer 
poaching tool. For each restaurant r, we identify customers 
who go to a "similar" restaurant in the area but do not visit 
r. The owner of r may use this to target advertisements. The 
evolutionary nature focuses on determining "similar" restau- 
rants and their users. 

A.5.1 Analyst 5, Version 1 

• Analyst goal: Find similar restaurants based the overlap 
of users that checkin to each place. 

• Query: From 4SQ, for all restaurant pairs {ri,r2), count 
the number of users that have visited both restaurants. 
Threshold on count above x\. 



A. 5.2 Analyst 5, Version 2 

• Analyst goal: Next, find restaurants that are similar as 
indicated by a user or the user's friends frequently visiting 
the same places. 

• Query: From Twitter, compute all pairs {ui,U2} of users 
that communicate with each other, assigning each pair a 
friendship-strength-score based on the number of times 
they communicate. Threshold on friendship-strength- 
score above X2 - 

From 4SQ, for all restaurant pairs (ri, r2), count the num- 
ber of users that have visited both restaurants, as well as 
the number of times a user ui has visited ri and one of 
their friends U2 has visited r2 . Threshold on count above 

X3. 

A. 5. 3 Analyst 5, Version 3 

• Analyst goal: Next, find restaurant pairs that are also 
similar based on the similarity of their menus. 

• Query: From Landmarks, create restaurant pairs {ri,r2) 
that have the same zip code and type. For each 
pair, apply UDF-MENU-SIMILARITY to obtain menu- 
similarity-score. Threshold on menu-similarity-score 
above X4. 

Join pairs (ri,r2) with pairs (ri,r2) from the previous 
version. 

A. 5 A Analyst 5, Version 4 

• Analyst goal: Finally, find users that visit one restaurant 
but not a similar restaurant. 

• Query: From 4SQ, for each restaurant r, identify the users 
that have visited r and the count of times they have vis- 
ited. For each restaurant pair {ri,r2) from the previous 
version, select users u that have visited ri more than jrs 
times and visited r2 less than xq times. 

A.6 Analyst 6 

Analysts tries to find out if restaurants are losing loyal 
customers. He wants to identify those customers who used 
to visit more frequently but are now visiting other restaurants 
in the area so that he can send them a coupon to win them 
back. The evolutionary nature of this scenario will focus on 
how to identify prior active users. 

A.6.1 Analyst 6, Version 1 

• Analyst goal: For each restaurant, identify other restau- 
rants with the same zip code and type that are less pop- 
ular. 

• Query: From Landmarks, create restaurant pairs {ri,r2) 
that have the same zip code and type, and r2 has a much 
lower checkin count than ri. 

A.6. 2 Analyst 6, Version 2 

• Analyst goal: Now, identify restaurants that have lately 
become less popular. 

• Query: From 4SQ, identify places that users checkin. 
Join with places in Landmarks that have type restau- 
rant. For each restaurant, compute the average number 
of checkins per month in the last xi^ months and the num- 
ber of checkins in the last 1 month. 



Threshold on the ratio of recent checkins to historical 
average checkins below X2_- 

A.6.3 Analyst 6, Version 3 

• Analyst goal: Next, find users that stopped visiting those 
restaurants. 

• Query: For restaurant r identified as becoming less pop- 
ular in the previous version, and a user u that visited r, 
compute the average number of checkins per month by 
user u in the last a^ months and the number of check- 
ins by user u in the last 1 month. Compute the ratio of 
recent checkins to historical average checkins 
Threshold on ratio below X4. 

A.6.4 Analyst 6, Version 4 

• Analyst goal: Finally, find users that no longer frequent 
a particular restaurant but still visit other restaurants in 
the same area. 

• Query: From 4SQ, for each user u, count the number of 
checkins by zip code. Threshold on count above X5. 

For each less popular restaurant r identified in previous 
version, retain u only if ii still frequently visits restaurants 
in the same zip code as r. 

A.7 Analyst 7 

Analyst? wants to identify the direct competition for 
poorly-performing restaurants. He first tries to determine 
if there is a more successful restaurant of similar type in the 
same area. The evolutionary nature focuses on identifying 
good and bad restaurants in an area, as well as what cus- 
tomers like about the menu, food, service, etc. about the 
successful restaurants in the area. 

A. 7. 1 Analyst 7, Version 1 

• Analyst goal: For each zip code, identify good and bad 
restaurants. 

• Query: From Landmarks, identify places that are restau- 
rants, and apply UDF-SENTIMENT on the restaurant 
comments to obtain a sentiment-score. For each zip code, 
retain restaurants with sentiment-score above xi_ or below 
X2 as good and bad restaurants. 

A.7. 2 Analyst 7, Version 2 

• Analyst goal: Next, refine the discrimination of restau- 
rants as good and bad based on their popularity. 

• Query: From 4SQ, obtain the checkin count for every 
restaurant. 

Threshold on count above a::3. Join with the good restau- 
rants from the previous version. 

Threshold on count below xa_. Join with the bad restau- 
rants from the previous version. 

A.7. 3 Analyst 7, Version 3 

• Analyst goal: Further discriminate restaurants as good 
and bad based repeat checkins. 

• Query: From 4SQ, obtain the checkin count for every 
restaurant. For each restaurant, count the number of 
users that checkedin only once, and the number of users 



that checkedin more than xt times. Compute the ratio of 
single checkins to multiple checkins. 

Threshold on ratio below x^. Join with the good restau- 
rants from the previous version. 

Threshold on ratio above xi. Join with the bad restau- 
rants from the previous version. 

A. 7.4 Analyst 7, Version 4 

• Analyst goal: Next, for each restaurant, find the most 
frequent entities with positive and negative comments. 

• Query: From 4SQ, apply UDF-NLP-ENTITITY- 
SENTIMENT per user checkin. For each restaurant and 
each entity, aggregate the sentiment-score. Threshold on 
sentiment-score above a^. 

Join with good and bad restaurants from previous version. 

A.8 Analyst 8 

Analysts wants to recommend a high-end hotel vacation 
in an area users will like based on their known preferences 
for restaurants, theaters, and luxury items. The evolutionary 
nature focuses on matching user's preferences with the types 
of businesses in an area. 

A.8. 1 Analyst 8, Version 1 

• Analyst goal: Find users who talk about 'luxury items'. 

• Query: From Twitter, apply UDF-CLASSIFY-LUXURY- 
SCORE on user tweets. For each user, count the number 
of tweets about luxury-items. Threshold on count above 

Xl. 

A. 8. 2 Analyst 8, Version 2 

• Analyst goal: Next, identify restaurants those users fre- 
quently visit. 

• Query: From 4SQ, for each user, count the number of 
checkins per restaurant. Threshold on count above X2- 
Join with users from previous version. 

For each restaurant, count the total number of checkins 
by all these users. Threshold on count above a^. 

A. 8. 3 Analyst 8, Version 3 

• Analyst goal: Next, find areas that have a high density of 
these restaurants and identify the distribution of restau- 
rant types in the area. 

• Query: For the restaurants from the previous version, ap- 
ply UDF-GRID-CELL. Count the number of restaurants 
per grid cell. Threshold on count above X4. 

For each grid cell, compute a histogram on the restaurant 
type and count. 

A.8.4 Analyst 8, Version 4 

• Analyst goal: Finally, match users to grid cells and find 
luxury hotels in their matching grid cell. 

• Query: For each user u from previous version, identify 
location, and compute a histogram on the restaurant type 
and u's checkin count. 

Match M to a grid cell such that the grid cell is sufficiently 
far away from u's location, and there is a significant over- 
lap between u's histogram and grid-cell g histogram from 
previous version. 



From Landmarks, find hotels with rating greater than 
X5 stars, and apply UDF-GRID-CELL to convert hotel 
location to grid cell g' . 

For each user u, join grid cell g with g' to identify hotels 
in an area matching u's restaurant preferences. 



